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Reconstruction of gene regulatory networks (GRNs) from experimental data is a funda- 
mental challenge in systems biology. A number of computational approaches have been 
developed to infer GRNs from mRNA expression profiles. However, expression profiles 
alone are proving to be insufficient for inferring CRN topologies with reasonable accu- 
racy. Recently, it has been shown that integration of external data sources (such as gene 
and protein sequence information, gene ontology data, protein-protein interactions) with 
mRNA expression profiles may increase the reliability of the inference process. Here, I 
propose a new approach that incorporates transcription factor binding sites (TFBS) and 
physical protein interactions (PPI) among transcription factors (TPs) in a Bayesian variable 
selection (BVS) algorithm which can infer GRNs from mRNA expression profiles subjected 
to genetic perturbations. Using real experimental data, I show that the integration of TFBS 
and PPI data with mRNA expression profiles leads to significantly more accurate networks 
than those inferred from expression profiles alone. Additionally, the performance of the 
proposed algorithm is compared with a series of least absolute shrinkage and selection 
operator (LASSO) regression-based network inference methods that can also incorporate 
prior knowledge in the inference framework. The results of this comparison suggest that 
BVS can outperform LASSO regression-based method in some circumstances. 

Keywords: network inference, Bayesian statistics, data interpretation, statistical, variable selection, gene regulatory 
networks 



INTRODUCTION 

Understanding how genes regulate each other to orchestrate cellu- 
lar phenotypes is a fundamental challenge of Biology. A straight- 
forward way of uncovering gene regulatory networks (GRNs) is 
to perturb each gene of the network, e.g. by means of siRNAs and 
chemical inhibitors, and measure the effects of these perturbations 
on the expression of other genes in the network (Kholodenko et al., 
2002; Wagner, 2002). However, the effects of such perturbations 
rapidly propagate through the entire network, causing widespread, 
global changes in the gene expressions, making it challenging to 
differentiate the direct interactions from the indirect ones. Several 
computational approaches were proposed to unmask the direct 
gene regulatory interactions by systematically analyzing pertur- 
bation responses (Kholodenko et al., 2002; Repsilber et al., 2002; 
Wagner, 2002; Gardner et al., 2003; Hartemink, 2005; Rogers and 
Girolami, 2005; de la Fuente and Makhecha, 2006; Margolin et al., 
2006; Bansal et al., 2007). Many of these studies found that the 
steady-state perturbation responses of a gene are linearly depen- 
dent on the same of its direct regulators (Kholodenko et al., 2002; 
Gardner et al., 2003; Rogers and Girolami, 2005; de la Fuente and 
Makhecha, 2006; Bansal et al., 2007). These findings presented a 
unique opportunity of identifying direct genetic interactions by 
simply solving a set of linear equations. Although this approach 
seems simple in theory, implementing it in practice is not straight- 
forward. First, biological measurements are noisy and contain 
experimental errors. The noise in biological datasets may cause 
significant errors while reconstructing GRNs by solving linear 



equations. Second, and perhaps most importantly, in order to 
solve these linear equations, one needs to perturb a GRN at least 
as many times as the number of genes in the network and measure 
the responses of all its genes after each perturbation (Kholodenko 
et al, 2002; Gardner et al, 2003; Rogers and Girolami, 2005; 
de la Fuente and Makhecha, 2006; Bansal et al., 2007). There- 
fore, reconstructing genome scale GRNs using the above method 
requires thousands (for simple organisms, e.g. bacteria, fungus, 
etc.) and often tens of thousands (for complex organisms such as 
mammals) of perturbation experiments that are time consuming 
and expensive. Most perturbation experiments, except those per- 
formed in some simple model organisms such as Escherichia coli 
(Baba et al., 2008) or yeast (Hughes et al., 2000), involve far fewer 
perturbations than the number of genes in the GRN. As a result, 
the datasets produced by these experiments do not have enough 
information for a full reconstruction (by solving linear equations) 
of the corresponding GRNs. Several statistical algorithms have 
been proposed to resolve this issue. For instance, some authors 
used singular value decomposition and linear regression (Yeung 
et al., 2002; Guthke et al., 2005; Zhang et al, 2010) to reconstruct 
GRNs using datasets obtained from a small number of pertur- 
bation experiments. Huang et al. (2010) used regulator filtering, 
forward selection, and linear regression for GRN reconstruction; 
and Imoto et al. (2003) used non-parametric regression embed- 
ded within a Bayesian network for the same purpose. Several other 
regression techniques such as the elastic net (Zou and Trevor, 2005; 
Friedman et al., 2010) and least absolute shrinkage and selection 
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operator (LASSO; van Someren et al, 2003; Li and Yang, 2004; van 
Someren et al, 2006; Shimamura et al, 2007; Hecker et al, 2009, 
2012; Lee et al, 2009; Charbonnier et ah, 2010; Gustafsson and 
Hornquist, 2010; James et al, 2010; Pan et al, 2010; Peng et al, 
2010; Wang et al., 2013) have also been widely used to reconstruct 
GRNs from noisy and insufficient perturbation response data. 

Although many of these algorithms perform reasonably well, it 
is being increasingly clear that the accuracy of these algorithms can 
be significantly increased by integrating external data sources, e.g. 
gene sequence, single nucleotide polymorphism (SNP), protein- 
protein interaction (PPI), etc., in the network reconstruction 
process (Yeung et al, 2011; Lo et al., 2012). Public data reposi- 
tories provide a rich resource of biological data related to gene 
regulation. Integrating data from these external data sources into 
network inference algorithms has become a primary focus of the 
systems and computational biology community. Previously, James 
et al. (2010) incorporated documented transcription factor bind- 
ing sites (TFBS) information to infer the GRN of E. coli. Djebbari 
and Quackenbush (2008) used preliminary GRN derived from 
PubMed indexed literature and PPI databases as prior knowl- 
edge for their Bayesian network reconstruction algorithm. Zhu 
et al. (2004) combined TFBS and PPI data to infer GRNs. Imoto 
et al. (2003) used PPI, documented TFBS, and well studied path- 
ways as prior information for their network inference method. 
Lee et al. (2009) presented a systematic way to incorporate various 
types of biological knowledge, such as the gene ontology (GO) 
annotations, data from ChlP-ChIP experiments, and a compre- 
hensive collection of information about sequence polymorphisms. 
Yeung et al. (2005), Yeung et al (2011), and Lo et al. (2012) 
developed a Bayesian model averaging approach to systematically 
integrate publicly available TFBS data, ChlP-ChIP data, physical 
interactions, genetic interactions, additional expression data, and 
literature curation. 

This study is an extension of our previous work (Santra et al, 
2013) which used a Bayesian framework that was designed to 
reconstruct biochemical networks by analyzing steady-state per- 
turbation response data. In our previous study, we used Bayesian 
variable selection (BVS) algorithm to account for model uncer- 
tainty under noisy and insufficient data. Only generic topological 
knowledge such as sparsity of biochemical networks was used 
as prior information in the network reconstruction process. No 
external knowledge regarding potential interactions between net- 
work components was used to guide the inference process. The 
contributions of this study are four folds. First, a simple and an 
intuitive technique is proposed to incorporate external knowl- 
edge into the BVS framework in the form of a prior distribution. 
Second, a new way of integrating protein interactions among tran- 
scription factors (TFs) into the network inference framework is 
proposed. Although, PPI data were used previously (Zhu et al, 
2008) in the context of GRN inference, the approach used by pre- 
vious researchers was very different from the approach used in 
this study. For instance, protein interactions among target genes 
were used by Zhu et al. (2008) to determine co-regulation of 
multiple genes. Here, we use protein interaction among TFs to 
determine combinatorial regulations by multiple TFs. Third, as a 
proof of concept, the proposed methodology is applied to a gene 
expression dataset obtained from a liver-enriched TF regulatory 



network, revealing that it significandy outperforms our previous 
work. Finally, the performance of the proposed method is com- 
pared with a LASSO regression-based network inference method 
using publicly available gene expression datasets. 

The rest of this study is organized as follows. In the next Section 
"Linear Model of Gene Regulation", I briefly discuss linear models 
of gene regulation, followed by a detailed description of the pro- 
posed BVS algorithm in Sections "The Bayesian Variable Selection 
Algorithm" and "Sampling Scheme for the Proposed BVS Frame- 
work." In Section "Integrating External Data to Formulate -P(A')," 
I present a new method of integrating external data sources in the 
BVS formulation. An implementation of this method to infer a 
liver-specific GRN is then discussed in Section "Inferring Liver- 
Specific Gene Regulatory Network from Perturbation Response 
Data." In this section, I also compared the performance of the 
proposed BVS algorithm with our previous work. The results of 
comparing the proposed method with other network inference 
techniques are presented in Section "Inferring GRN of Human 
Breast Epithelium and Comparison with LASSO." Finally, in the 
conclusion section, I discuss the advantages and disadvantages of 
our algorithm and future directions. 

LINEAR MODEL OF GENE REGULATION 

When a GRN is perturbed, the effect of the perturbation rapidly 
propagates through the entire network, causing widespread, global 
changes in the expression levels of its genes. It has been shown 
(Rogers and Girolami, 2005; Bansal et al., 2007; Lo et al., 2012) 
that the responses (x' = {xy,;= 1, . . ., Hp}) of a gene (g,), to a series 
of («p) perturbations, are linearly dependent on the responses 
(X*= {xij, A:= 1, ...,«,-,;'= 1, «p. A: ;'}) of its direct regulators 
i^ = {gk,k=l,...,n„k^i}),i.e., 

x'=X'''p' (1) 

where m; is the number of regulators of the gene (g,), and P' = 
k= I, . . .,n„ i} are the linear coefficients that represent the 
strengths and types of the interactions between the gene (g,) and 
its direct regulators (g*). 

The measurements of the expression levels usually contain 
experimental errors, and may not exactly fit into the above Eq. 1. 
The difference between the left and right hand side of Eq. 1 caused 
by such errors are called the "residuals." In order to compensate 
for errors, the residuals are added to Eq. 1 leading to, 

x'=X''^P' + €' (2) 

where €' = {e^, 1, . . ., rip} represents the residuals caused by 
measurement errors. It can be easily showed that the residual vari- 
ables (6y) are linear combinations of the individual measurement 
errors associated with the perturbation responses of the gene (gi) 
and its regulators (g") (Kariya and Kurata, 2004). Since, the mea- 
surement errors are random in nature, the residual variables are 
also random variables, and by central limit theorem, these vari- 
ables have Gaussian distribution (Kariya and Kurata, 2004). It is 
further assumed that the residual variables («') are independent of 
each other and have 0 mean and variance which depend on the 
extent of experimental/measurement error in the dataset (Rogers 



Frontiers in Bloengineering and Biotechnology | Bioinformatics and Computational Biology 



May 2014 I Volume 2 | Article 13 | 2 



Santra 



Data integration for network inference 



and Girolami, 2005; de la Fuente and Makhecha, 2006; Bansal et al, 
2007; Vignes et al, 201 1; Santra et al, 2013). 

To identify the direct regulators (g*) of the gene (gi), one needs 
to calculate P' by solving Eq. 2 in a least-square sense. The ele- 
ments of P' whose absolute values are significantly >0 are 
then selected as direct interactions, and the corresponding genes 
igk) are considered to be the direct regulators of gi. However, solv- 
ing Eq. 2 requires at least as many perturbations as the number 
of genes («) in the network (Kholodenko et al, 2002; Rogers and 
Girolami, 2005; de la Fuente and Makhecha, 2006; Santra et al, 
2013). Under most circumstances, it is not possible to perform 
so many perturbation experiments, and therefore, in such cases, 
a full GRN reconstruction is not feasible by solving Eq. 2, either 
exactly or in a least-square sense. This issue is resolved by variable 
selection algorithms. 

BAYESIAN VARIABLE SELECTION ALGORITHM 

Variable selection algorithms find the most likely set of regulators 
(^) for each gene (^,) by iteratively solving Eq. 2. It should be noted 
that the inferred interactions between a gene (gi) and its regulators 
(g") may not always represent causal relationships. In many cases, 
these interactions represent "acausal" dependencies between gene 
expressions (Guyon and Elisseeff, 2003). Yet, it has been shown 
that variable selection algorithms can infer gene regulatory pro- 
grams with reasonable accuracy (Yeung et al., 2005, 201 1; Lo et al., 
2012). The mechanism of a simple variable selection technique in 
the context of GRN reconstruction is described below. 

(a) First, a random set of genes (gj) are selected as the potential 
regulators of a gene (gi), and the least-square estimates (^'ij 
of the corresponding interaction strengths and the resulting 
sum of square error (e^P^ = | It^ I calculated. 

(b) At the next iteration, a different set of genes (gj) are selected as 
the potential direct regulators of gene gi, and again, the least- 
square estimates (^'ij °f corresponding interaction strengths 

and the resulting sum of square error (efP^ = ||«2ll^) are 
calculated. 

(c) The newly calculated sum of square error (efP^) is then com- 
pared with the one (e^P^) calculated in the previous iteration. 
If 6^P^ < e^P^, then the new set of potential regulators (gj) is 
considered more likely to directly regulate gi than the previous 
one {g\), otherwise the old set is retained as the most likely 
potential regulators. 

(d) For each gene (g), the above procedure is repeated for all possi- 
ble combination of potential regulators until a set of regulators 
is found that has the minimum sum of squared error. 

The above scheme is simple in theory, but there are some major 
obstacles in implementing it in practice. For instance, if we want 
to reconstruct a GRN involving 1000 genes, then, for each gene, 
we need to iterate through 2^^^ possible combinations of poten- 
tial regulators to find its most likely direct regulators. Iterating 
through so many possible combinations is not feasible even for 
the most advanced computing systems. Therefore, we must adopt 
a smarter strategy to find the most likely set of regulators of each 
gene in a GRN. BVS algorithms (in general) implement efficient 



search strategies to identify the most likely regulators of a gene in a 
reasonable time. Here, I adopted a BVS framework which is similar 
to our previous work (Santra et al., 2013) with a few exceptions. 

To formulate the BVS algorithm, it is convenient to represent 
the topology of a GRN using a binary "adjacency" matrix (A). 
A non-zero entry {Aii^ = 1, /c /= ;') of this matrix represents direct 
regulation of one gene (gi) by another (gj^, A:/:;), whereas the zero 
elements indicate no direct regulation. Consequently, the non- 
zero elements of the !th row (A' = {A,j-, k=l,. . .,n,ky^ i}) oi this 
matrix represent interactions between the gene gi and its direct 
regulators (g"). Note that the binary adjacency matrix (A) and the 
matrix of interaction strengths ( P ) are closely related, since absence 
of direct interaction (A,j- = 0, i k) between two genes (gi, gk) 
implies zero interaction strength (P,fc = 0, i^k). In other words, 
the elements (Pik, i # k) of the interaction strength matrix (P) 
corresponding to the zero elements (A,j- = 0, / k) of the binary 
adjacency matrix (A) are also zero. Therefore, finding the most 
likely direct regulators of a gene (gi) amounts to finding the most 
likely combination of Os and Is in the xth row (A') of the binary 
matrix A. 

To avoid iterating through all possible combinations of A', BVS 
algorithms adopt a Bayesian approach. Bayesian algorithms closely 
mimic the natural learning process of human brain that updates 
its knowledge about certain events when it receives new informa- 
tion about the event. In these algorithms, the prior knowledge 
about a certain event is represented by its prior distribution which 
assigns a prior probability to each possible outcome of the event. 
When new information becomes available, the prior probabilities 
are updated using Bayes' theorem. The updated probability dis- 
tribution is known as the posterior distribution. The posterior 
distributions represent our up-to-date knowledge about a certain 
event based on the data that have been recently available. 

In the context of GRN reconstruction, any prior knowledge 
about the possible regulators (g*) of each gene (g,) is encoded in 
the prior distributions (_P(A')) of the binary vectors A'. In our 
previous work (Santra et al, 2013), we formulated the prior dis- 
tribution P(A') to penalize gene regulation models with too many 
regulators and favored sparse models where each gene is regulated 
by a small number of regulators. No other external knowledge was 
used to formulate the prior distribution of A'. Here, we take a 
different approach and formulate a more informative prior dis- 
tribution of A' by integrating TFBS and PPI between TFs. The 
process of integrating PPI and TFBS data into the prior distribu- 
tion of A' is an important aspect of data integration and will be 
discussed in detail in the next section. 

Prior information about the possible values of the interaction 
strengths (P') is rarely available. In the absence of any specific 
prior knowledge of the possible values of P', it is safe to assume 
that its non-zero elements can take a wide range of positive or 
negative values depending on whether the corresponding inter- 
action is activating or repressing. The zero elements represent 
no direct interaction and correspond to the zero elements of A'. 
This assumption is formulated by assigning a multivariate Gauss- 
ian prior to the non-zero elements of P'. The prior distribution 
of P' is assumed to have zero mean and covariance matrix V^i, 
which is a (n; x «i) matrix that represents our prior knowledge 
about the possible ranges of values of P'. A common approach 
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is to assume that the prior covariance matrix (V^i) of P' is pro- 
portional to the scaled fisher information matrix (FIM) of p', 
i.e. Vpi = ca , where c is the proportionality con- 

stant (also known as Zellner's constant) which determines the 
span of the prior distribution of P' (Zellner, 1986; Ishwaran and 
Rao, 2005; Gupta and Ibrahim, 2009) and is the scaling factor 
which is the same as the variances of the residual variables €y. The 
above formulation of the covariance matrix assumes that the vari- 
ances/covariances of the interaction strengths depend not only on 
the inherent variability of the perturbation responses, but also on 
the variance of the measurement errors. It was shown by other 
researchers that the choice of the proportionality constant c has 
a significant impact on the performance of BVS algorithms and 
several studies were conducted to find the most appropriate value 
of c (George and Foster, 2000; Fernandez et al., 2001; Hansen and 
Yu, 2001; Liang et al, 2008). Fernandez et al. (2001) demonstrated 
that among the commonly used values, c = max (ftp, n^) per- 
forms the best in most scenarios. Therefore, this value was chosen 
for the BVS framework presented in this study. 

The prior knowledge about the noise variance is incorpo- 
rated in its prior distribution. Previously, the noise variance 
was assumed to have a gamma distribution with shape and scale 
parameters, a and p, respectively (Santra et al, 2013). The val- 
ues of these parameters were set to 1 to ensure a flat prior, which 
represents our lack of prior knowledge about extent of noise in 
the dataset. Here, in order to avoid extra hyper parameters (a, P), 
we assumed that has Jeffrey's prior (Fernandez et al, 2001), 
i.e.pia^) ~ ^, which is an uninformative "improper" prior that 
relies on the notion that noises in biological data are unlikely to 
cause very large residuals in the linear models. 

These prior distributions can then be updated to posterior dis- 
tributions based on the measured perturbation responses of the 
network using Bayes formula. Here, we are interested in the pos- 
terior distributions of binary vectors A', i=l,. . ., n, since these 
vectors represent the network topology. It is straightforward to 
show that the posterior distribution (P(A'lx', X)) of A' given the 
perturbation responses of gene gi and its regulators is (Liang et al, 
2008; Note 1 in Supplementary Material) 



'(a'|x',X') 



(1 + c) 



\ + c 



(3) 



here 



is the coefficient of deter- 



mination of the linear model shown in Eq. 2, where P' = 

X' X' ) X' x' is the least-square estimate of P', and x' is the 

sample average of x'. 

Finding the most likely regulators of gene gi is equivalent to 
finding the configuration of A' that maximizes the above posterior 
probability (Eq. 3). But, as discussed before, finding this con- 
figuration requires iterating through all possible configurations 
of A', which is hardly possible for large networks. An alterna- 
tive approach is to estimate the "expected" configuration of A' 



using model averaging techniques that identify a number of "good 
enough" configurations instead of a single "best" configuration. 
The average of these good configurations is commonly used as an 
approximation of the "expected" configuration of A'. The "good 
enough" configurations of A' can be determined in reasonable time 
by drawing samples from the above posterior distribution (Eq. 3) 
using a Markov Chain Monte Carlo (MCMC)-based sampling 
algorithm. 

SAMPLING SCHEME FOR THE PROPOSED BVS FRAMEWORK 

A typical MCMC-based sampling algorithm iteratively explores 
different configurations of A' in order to find those with relatively 
high posterior probability. In each iteration, it calculates the poste- 
rior probability of the current and a proposed new configuration 
of A'. However, in some cases, it is not possible to calculate the 
posterior probability of certain configurations of A'. For instance, 
when n, ^ rip, i.e. the number of Is in A' is larger than the num- 
ber of perturbations, then the corresponding data matrix X has 
dimensions rip x «; and suffers from rank deficiency. Therefore, 

■T 

the Gram matrix X' X' is non-invertible and the correspond- 
ing coefficient of determination (J?) and the posterior probability 
(P(A'lx', X')) do not exist. Previously (Santra et al., 2013), we 

■T 

addressed this issue by adding a diagonal loading {X' X' + S7) to 
the Gram matrix, ensuring its invertibility. However, this approach 
requires the estimation of an optimal value for the loading con- 
stant (S), which adds to the complexity of the sampling process. 
Additionally, the effects of diagonal loading on the overall outcome 
of BVS algorithms are not well understood. In this study, a differ- 
ent strategy is adopted to address the above issue. Here, in order 
to avoid rank deficiency, the search space (^) of the MCMC algo- 
rithm is constrained to only those configurations of A' which has 
less number of Is than the number of perturbations, i.e. n, < tip. 
The restricted search space is denoted by t^,,^ (t,„^ C where 
the subscript rip indicates the upper limit on the number of Is 
in the configurations of A'. The above approach has two major 
advantages over the previous method. First, it ensures the existence 
of the posterior probability without artificial diagonal loading of 
the Gram matrix. Second, it decreases the computational com- 
plexity of the MCMC algorithm by reducing the size of the data 
matrix X. This property makes this approach particularly attrac- 
tive for inferring large GRNs where computational complexity is 
a major issue for MCMC-based variable selection algorithms. The 
computational cost of sampling can be significantly reduced by 
further restricting the search space to an even smaller subspace 
{Kk ^ ?«p)' which contains only those configurations of A' that 
have less than k (where k<np) numbers of Is. Restricting the 
search space to t,i implies that the MCMC algorithm will explore 
regulatory programs (configurations of A') consisting of at most 
fc < Hp regulators for each gene {gi). For accurate network infer- 
ence, it is therefore desirable to assign the restriction parameter (k) 
a reasonable value that is not far from the ground truth. Although, 
there is no easy way of determining an optimal k, one can use prior 
information about the topology of the network to have a broad 
estimate of this parameter. This is discussed in the results section 
where the proposed algorithm is implemented on experimental 
data sets to infer GRNs. In the rest of this section, I continue with 



Frontiers in Bloengineering and Biotechnology | Bioinformatics and Computational Biology 



May 2014 I Volume 2 | Article 13 | 4 



Santra 



Data integration for network inference 



the discussion of the MCMC-based sampHng algorithm, which is 
used in this study to explore the restricted search space of 
potential gene regulatory programs. 

A Metropolis-Hastings algorithm was implemented to sys- 
tematically explore t,k and identify highly probable regulatory 
programs (A'). The sampling algorithm starts with a random 
configuration of A' e i;^. A new configuration A' G is then 
proposed based on a proposal distribution Q. The proposal dis- 
tribution (Q) is formulated as follows. Let r\{A') C ^j. denote a set 
of binary vectors consisting of all possible configurations that can 
be obtained by changing one of the elements of A' from 0 to 1 or 
vice versa. Define a proposal distribution Q as follows. 

q(a',A')= V, . (4) 

^ ^ [o if A' ^Ti(A') 

Based on the above proposal distribution, an acceptance ratio 

p(a''\x'',x')q(a''a') 
a = — ^-7^^ J- is computed. The proposed new confie- 

uration A'' is then accepted with probability min( 1 , a ) . If accepted, 
A' is added to the sequence of drawn samples and becomes 
the current configuration. Else, A' remains the current config- 
uration. Repeating this procedure in an iterative manner gives 
rise to an irreducible Markov chain in the restricted search space 
(^i:). This Markov chain asymptotically converges (Geyer, 20 11 ) to 
the desired posterior (P{A'\x', X')). Upon convergence, the sam- 
ples drawn by the chain resemble those drawn from the posterior 
(_P(A'lx', X')), and therefore, the most probable configurations of 
A' appear more frequently in the drawn samples than the improb- 
able ones. These samples are then used to determine an "average" 
or "expected" regulatory program for each gene (gi). The expected 
probability that a gene (g,) is regulated by another gene (gk) is 
estimated by calculating the ratio of the number (ny) of samples 
whose jth element is 1 to the total number (Wj) of samples, i.e. 
P{Aij = l\Xi,X') = ^ (Muldierjee and Speed, 2008). Calculat- 
ing this probability for each pair of genes results in a probabilistic 
representation of the network topology. 

The above sampling algorithm draws samples from the pos- 
terior distribution of A' (Eq. 3) which depends on its prior 
distribution. This can be exploited to incorporate prior knowl- 
edge from external data sources into the BVS algorithm. To do 
so, the prior distribution (P(A')) needs to be formulated in such 
a way that it favors the likely interactions supported by external 
data sources. This will bias the posterior of A' toward the interac- 
tions that are supported by external data. Below, I show a scheme 
that integrates TFBS with PPI information to formulate the prior 
distribution -P(A'). 

INTEGRATING EXTERNAL DATA TO FORMULATE P{A') 

Genes regulate each other via several mechanisms, e.g. transcrip- 
tional regulation, methylation, histone acetylation, etc. Among the 
known mechanisms of gene regulation, transcriptional regulation 
via TFs is perhaps the most well-studied gene regulatory mecha- 
nism. In the case of transcriptional regulation, proteins produced 
by regulatory genes undergo post-translational modifications and 
then either directly bind to the promoter regions of target genes 



or form multi-protein transcription factor complexes (TFCs) that 
bind to the gene promoters and regulate the activity of the corre- 
sponding genes. The regulatory proteins and TFCs bind genes at 
specific locations containing specific nucleotide sequences, com- 
monly referred to as TFBS. These binding sites are experimentally 
determined by ChlP-ChIP experiments (Hughes et al., 2000) 
and/ or computationally predicted by statistical algorithms (Matys 
et al., 2006; Bryne et al, 2008; Bailey et al, 2009; Ernst et al, 2010). 
There are a number of databases that contains vast amount of 
information on binding specificities (TFBS) of several TFs and 
TFCs (Matys et al, 2006; Bryne et al, 2008; Bailey et al, 2009). 
However, there are some limitations of incorporating these infor- 
mations as prior knowledge into a network inference algorithm. 
First, the binding specificities are known only for a fraction of all 
TFs and TFCs that are found in nature. For a large number of 
TFs and TFCs, such information is unavailable. It is challenging 
to interpret the unavailability of information in an unambiguous 
manner. For instance, it is difficult to determine whether the lack 
of information represents absence of interaction or simply lack 
of knowledge about the presence of interaction. Second, TFs may 
indirectly regulate genes by forming protein complexes (TFCs) 
with other TFs which directly bind to gene promoters. Many of 
these indirect regulations are not well characterized, further con- 
tributing to the incompleteness of prior knowledge regarding gene 
regulation. 

To address the above issues, I propose a simple scheme of 
incorporating available knowledge into the prior distribution of 
A'. The proposed prior distribution favors potential regulatory 
interactions supported by TFBS data available in public databases. 
However, it does not exclude the possibility of potentially new 
interactions that are not supported by external sources. Further- 
more, it uses information regarding protein interactions among 
the TFs to determine potential indirect gene regulations. These 
indirect regulatory interactions, along with the TFBS specificities, 
are then collectively used as potential regulatory interactions in 
the formulation of the prior distribution of A'. A step-by-step 
description of using external data sources to formulate the prior 
distribution of A' is shown below. 

Step 1: First, TFBS information are collected from multi- 
ple external sources, e.g. public databases such as HTRIDB 
(Bovolenta et al, 2012), ENCODE (Hughes et al, 2000), KEGG 
(Ogata et al., 1999), ConsensusPathDB (Kamburov et al, 2011), 
etc., published literature (Ernst et al., 2010), computational 
TFBS prediction services such as MEME (Bailey et al, 2009), 
TRANSFAC (Bryne et al, 2008), JASPER (Matys et al, 2006), 
TRED (Jiang et al., 2007), etc. 

Step 2: Next, information regarding PPIs among known TFs are 
obtained from publicly available sources. Recently, Ravasi et al. 
(2010) determined a comprehensive map of physical interac- 
tions among mammalian TFs using mammalian two-hybrid 
system. They identified around 800 protein interactions among 
human and mouse TFs. Arguably, this dataset is the most reliable 
source of information regarding protein interactions among 
TFs and is used in the large-scale CRN inference study later in 
this study. However, Ravasi et al.'s study does not cover all mam- 
malian TFs, in which case proteins interaction databases such as 
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STRING (Szklarczyk et al, 201 1), HPRD (Keshava Prasad et al, 
2009), IntAct (Kerrien et al, 2012), BIND (Bader et al, 2003), 
KEGG (Ogata et al, 1999) is used to determine PPI between 
TPs. It should be noted that many of these databases store 
functional and computationally predicted PPIs which may not 
always represent physical protein bindings. Since, we are inter- 
ested in physical interactions among TPs, only physical PPIs 
are carefully selected from the above databases, functional and 
computationally predicted PPIs are excluded from the list of 
potential TF-TF protein interactions. 

Step 3: The above information is then used to build a prior 
network that contains both direct and indirect regulations sup- 
ported by external data. Potential direct regulations are iden- 
tified using TFBS information as described above (see Step 
1). Potential indirect regulations are identified based on the 
assumption that if a TF binds to another TF which targets a 
certain gene, then the former indirectly regulates the target of 
the later (Figure 1). Both direct and indirect regulations are 
incorporated in the prior network as potential transcriptional 
interactions. The prior network is represented by a weighted 
adjacency matrix ( F) . The non-zero elements of this matrix rep- 
resent potential transcriptional regulations supported by TFBS 
and PPI data. The value of a non-zero element (Fy 0) repre- 
sents our confidence on the regulation of a gene (gi) by another 
igj). In this study, equal confidence is placed on all potential 
transcriptional regulations that are supported by TFBS and PPI 
data, i.e., Fy = if gene g, has a TFBS for gj or any of its binding 
partners. Here, is called the confidence parameter. The xth 
row (F') of the prior adjacency matrix (F) represents our prior 
knowledge about the regulatory program of gene gi and is used 
to formulate the prior distribution of the binary vector A' in the 
following manner. 

P(A') oc expiV^A') : A' e ?t 
= 0 otherwise. (5) 

The above prior distribution ensures that the prior probability 
of A' e depends only on the number of interactions (Ay= 1) 
which are supported by prior information (Fy = Uc). This implies 
that if two different configurations of A' have different numbers of 
potentially new interactions (Ay = 1, Fy = 0) but the same num- 
ber of previously known interactions (Ay = 1, Fy = o(c), then these 
two configurations have the same prior probability. Therefore, 
the above prior distribution (Eq. 5) favors regulatory programs 
(configurations of A') that have large number of known interac- 
tions (Fy = a(;) but does not penalize the presence of previously 
unknown interactions, allowing such interactions to be seamlessly 
inferred by the variable selection algorithm. 

As a proof of concept, I implemented the above BVS algorithm 
to reconstruct a liver-specific transcription regulatory network by 
analyzing perturbation response data. To show the effectiveness 
of integrating TFBS and PPI data in the BVS framework, I used 
four different prior settings for A'. In the first setting, no exter- 
nal data source was used to formulate the prior distribution of A' 
and all possible regulatory programs (configurations of A') were 
considered equally likely a priori. In the second setting, no exter- 
nal data sources were used, but the prior distribution of A' was 
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FIGURE 1 1 Constructing prior transcription regulatory network using 
TFBS and PPI data. 



designed to favor sparse regulatory programs, i.e., the configura- 
tions of A' which has relatively fewer non-zero elements than zero 
elements. This approach is similar to that we adopted in our previ- 
ous work (Santra et al, 2013). In the third setting, a prior network 
was constructed using only direct regulatory interactions that were 
predicted from publicly available TFBS information. This prior 
network was then used to formulate the prior distribution of A' as 
shown in Eq. 5. In the final setting, I used both direct and indirect 
regulatory interactions that were predicted from both TFBS and 
PPI interaction data to construct the prior network. This prior net- 
work was then used to formulate the prior distribution of A'. The 
results of the above analysis are described in detail in the following 
section. 

INFERRING LIVER-SPECIFIC GENE REGULATORY NETWORK 
FROM PERTURBATION RESPONSE DATA 

Genes that play key roles in liver development, physiology, and 
disease are found to be tightly regulated by a handful of TPs, such 
as hepatocyte nuclear factors (HNFIA, HNFIB, HNF3A, HNF3B, 
HNF3G, HNF4A, HNF4G, and ONECUTl), CCAAT/enhancer- 
binding proteins (CEBPA and CEBPB), peroxisome proliferator 
activated receptors (PPARA, PPARD, and PPARG), retinoic acid 
receptors (RARA, RARB, and RARG), retinoid receptors (RXRA, 
RXRB, and RXRG), and RAR-related orphan receptors (RORA 
and RORC) (Schrem et al., 2002, 2004; Odom et al., 2004, 2006; 
Tomaru et al, 2009). The genes that encode these TPs are known 
to transcriptionally regulate each other to maintain a particular 
sequence of events leading to the normal development of liver tis- 
sues (Schrem et al, 2002, 2004; Odom et al, 2004, 2006; Tomaru 
et al., 2009). Therefore, uncovering the GRN involving the above 
genes is a fundamental step in understanding the physiological 
processes of liver development. For this purpose, Tomaru et al. 
(2009) perturbed the above GRN, one gene at a time, using siR- 
NAs and measured the steady-state expression levels of these genes 
after each perturbation. Here, these measurements were used to 
infer the topology of the above GRN. 

As mentioned above, four different versions of the aforemen- 
tioned BVS framework were used for network inference, each with 
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a different prior distribution of A'. In the first case, all configu- 
rations of A' were assumed to have equal prior probability, i.e. 
P(A') = y, where y is a constant. 

In the second case, the prior distribution of A' was designed 
to assign higher probabihties to those configurations of A' which 
have fewer ones than zeroes. For this purpose. A' was assumed to 
have a beta binomial distribution, 

. ,x ^ /„A B(„, + a.„,-n, + p) 
V / \ni) B(a,p) 

Here, «r is the number of potential regulators in gene gi. When 
all genes in the network are considered to be the potential regu- 
lators of nr = « — 1. The values of the shape parameters (a, P) 
were kept the same as those used in our previous work (Santra 
etal.,2013),i.e.a=l,p = 2. 

In the third setting, only TFBS information was used to 
construct the prior network (Figure 2A). TFBS information 
were collected from HTRIDB (Bovolenta et al, 2012), MEME 
(Bailey et al., 2009), TRANSFAC (Bryne et al., 2008), JASPER 
(Matys ct p!., 2006), TRED (Jiang ct al., 2007), and SABioscience 
(www.sabiosciences.com). Here, only those TFBS that were found 
within a 5000 bp region of the gene promoters were included 
in the analysis. This resulted in a total of 106 potential tran- 
scriptional regulations (excluding autoregulations, see Table SI 
in Supplementary Material for details) among the 21 TFs men- 
tioned above. These regulatory interactions were represented by a 
prior adjacency matrix (Ftfbs) whose non-zero elements repre- 
sent potential gene regulations and are assigned a value of ttc = 2. 
The ith row (r^pgj) of this matrix (Ftfes) represents our prior 
knowledge on the regulatory program of the ith gene gi, based 
solely on TFBS information, and was used to formulate the prior 
distribution of A'. 

In the fourth setting, both TFBS and PPI among TFs (Figure 2B; 
Table S2 in Supplementary Material) were used to determine 
potential gene regulations. The TFBS information was collected 
as described above. Information regarding PPIs among the above 
TFs was obtained from STRING (Szklarczyk et al, 2011) and 
HPRD (Keshava Prasad et al., 2009) databases (Table S2 in Sup- 
plementary Material). The TFBS and PPIs were used to determine 
potential direct and indirect regulatory interactions as described 
in the previous section (see Figure 1). These resulted in a total of 
217 potential gene regulatory interactions (excluding autoregula- 
tions; see Table S3 in Supplementary Material for details) which 
were used to construct the prior network matrix (Ftpbs-hpfi)- 
The non-zero elements of this matrix (Ftfbs+ppi) were assigned 
a value of = 2. The rows of the prior matrix (Ftfbs-i-ppi) were 
then used to formulate the prior distributions P(A'), j= 1, . . ., n. 

In all the above cases, the search space for the MCMC sampler 
was restricted to t,k, where the subscript k represents the upper 
limit on the number of regulators for each gene. The value of k 
was chosen to be the same as the average number of regulators per 
gene ~ lO) in the prior network (Fxfbs-i-ppi) constructed 
from TFBS and PPI data. 

The GRNs reconstructed using the above prior settings were 
then compared to a gold standard network (GSN) which was 
deduced by Tomaru et al. (2009) using matrix RNAi combined 



with rt-qPCR and Chromatin Immunoprecipitation (X-ChIP) 

experiments (see Figure SI in Supplementary Material). To recon- 
struct the GSN, Tomura et al. knocked down 19 of the above genes, 
one at a time, and measured the responses of these genes to each 
knockdown. If a gene responded to the knockdown of another, 
then the former was considered to be potentially regulated by the 
later. Based on this assumption, a set of potential gene regula- 
tory interactions (Grnai) were determined. This was followed by 
X-ChlP/qPCR analysis that determined the DNA binding pref- 
erences of six (TCFl, FOXAl, FOXA2, HNF4A, ONECUTl, and 
RXRA) of the above TFs. If a TF was found on the promoter of 
a target gene in the X-ChIP experiment, then the later was con- 
sidered to be potentially regulated by the former. A second set of 
potential gene regulations (Gxchip) were identified based on the 
X-ChIP measurements. The set of interactions (Gref) that were 
common to both the above networks (Grnai and Gxchip) were 
then considered to represent the GSN (Gref = Grnai H Gxchip)- 
The networks inferred by the proposed BVS frameworks with dif- 
ferent prior setting were then compared with the above GSN. Since 
the GSN contains information regarding the regulatory activities 
of only six (out of 2 1 ) TFs, I compared only the interactions involv- 
ing these TFs. The activities of the remaining 15 TFs were excluded 
from the comparison. 

Recall that the proposed BVS algorithm uses MCMC sampling 
to estimate the posterior interaction probabilities. These poste- 
rior probabilities represent the a posteriori confidence on each 
interaction based on the perturbation response, TFBS and PPI 
data. If the posterior probabiUty of an interaction is higher than 
a certain threshold (pth); then the corresponding interaction is 
considered to be a true interaction. On the other hand, if a pos- 
terior probability is less than or equal to this threshold, then the 
corresponding interaction is thought to be absent in the CRN. 
This implies that the topology of the reconstructed GRN depends 
on the threshold probability {pxh) and therefore, any comparison 
between the reconstructed GRN and the true GRN also depends 
on the choice of this threshold. For a more objective assessment, 
multiple GRNs are constructed from the above posterior prob- 
abihties using multiple different thresholds. Each reconstructed 
GRN is then compared with the true GRN and the number of 
correctly and incorrectly inferred interactions are counted. These 
counts are used to calculate the true positive rates (TPRs), false 
positive rates (FPRs), and precisions (PREs) of the reconstructed 
GRNs. The TPR is the ratio of total number of the correctly iden- 
tified interactions to the total number of interactions present in 
the GSN (Fawcett, 2004; Powers, 2011); the FPR is the ratio of 
the total number of incorrectly identified interactions and the 
total number of possible interactions that are absent in the GSN 
(Fawcett, 2004; Powers, 201 1); PRE is the ratio of the total number 
of correctly identified interactions to the total number of inter- 
actions present in the inferred network. Then, the TPRs (7-axis) 
are plotted against the FPRs (X-axis), and the PREs (F-axis) are 
plotted against TPRs (X-axis) in two separate plots, commonly 
known as receiver operating characteristic (ROC) and precision 
recall (PR) curves, respectively (Ta^vcctt, 2004; Po^vers, 2011). The 
areas under these curves, denoted by AUROC and AUPR, give an 
objective assessment of the accuracy of the GRNs reconstructed by 
the BVS algorithms (Fawcett, 2004; Powers, 2011). Both AUROC 
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FIGURE 2 I Integrating external data to infer liver-specific transcription 
regulatory network (Tomaru et al., 2009). (A) Prior network constructed 
fromTFBS information. (B) PPI among transcription factors. (C) Network 
inferred by flat prior (P(/l') = y). (D) Network inferred using sparse prior. 
(E) Network inferred using prior network constructed fromTFBS information 
only. (F) Network inferred using prior network constructed fromTFBS and PPI 



information. The interactions that occur with high and low posterior 
probabilities are represented by darker/thicker and lighter/thinner edges, 
respectively, in (C), (D), (E), and (F). (G) Average ROC curves of the inferred 
networks. (H) Average PR curves of the inferred networks. (I) mean and 
standard deviation of the area under the ROC and PR curves of the inferred 
networks. 



and AUPR can have values between 0 and 1, and the closer these 
values are to 1, the better is the accuracy of the inferred networks, 
with AUROC = 1 and AUPR = 1 being the ideal case. To perform a 
robust comparison, the proposed BVS algorithm was executed 50 



times under each prior setting, producing 50 posterior networks 
for each prior network (see Figures 2C-F for sample posterior 
networks inferred from different priors). ROC, PR curves, and the 
areas under these curves (AUROC and AUPR, respectively) were 
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calculated from each posterior network. The average ROC and PR 
curves of the networks that were inferred from the same network 
prior was then calculated for each prior setting (Figures 2G,H). 
The mean and standard deviations of the corresponding AUROC 
and AUPR values, calculated under different prior settings, are 
shown in Figure 21. The AUROC values calculated under differ- 
ent prior settings were then compared using Mann-Whitney U 
test (Mann and Whitney, 1947) to assess the effects of different 
network priors on the accuracy of the proposed BVS algorithm. 
These results suggest that the BVS framework that incorporates 
both the TFBS and PPI data performed better than those which 
incorporate no prior information (p = 0.99 x 10^*), only TFBS 
information (p = 2.05 x 10^'') as prior knowledge, and the sparse 
prior {p = 2.4 x 10^*). These results support our hypothesis that 
TFBS and PPI data can be collectively more predictive of potential 
GRNs than TFBS data alone. 

Finally, I assessed the sensitivity of the BVS framework to the 
confidence parameter (a^) by looking at the agreement between 
results obtained under different values of this parameter. For this 
purpose, five different values (a^ = 1> 2, 3, 4, 5) of the confidence 
parameters were used to formulate a total of 10 prior distributions, 
five of these use only TFBS information and the remaining five use 
both TFBS and PPI information. A GRN was reconstructed using 
each of these prior distributions, leading to 10 inferred networks. 
These networks were then compared with each other to determine 
whether different values of the confidence parameter (ac) had 
significant effect on the network inference process. The inferred 
networks were then compared with the networks inferred from 
no prior knowledge (NPK) and sparse priors, the prior networks 
(riFBS) Ttpbs+ppi)) and the reference network (REF). Pearson 
correlation coefficient was used for comparing these networks. 
The resulting correlation coefficients are shown in Figure 3. Val- 
ues close to unity indicate high degree of similarities between 
networks. The networks inferred from the same type of prior dis- 
tribution are in close agreement with each other, despite different 
values of the confidence parameter a^. This suggests that the pro- 
posed BVS framework is relatively insensitive to different values 
of He. However, the networks inferred from different types of pri- 
ors are mostly different from each other. Additionally, the inferred 
networks are also considerably different from the prior networks 
suggesting that the proposed Bayesian framework indeed strikes a 
balance between prior information and observed data. 

Encouraged by the above results, I implemented the proposed 
BVS framework to infer the regulatory mechanisms of the human 
breast epithelium and compared its performance with a state-of- 
the-art network inference method, which relies on LASSO regres- 
sion. The results of this comparison are described in detail in the 
next section. 

INFERRING GRN OF HUMAN BREAST EPITHELIUM AND 
COMPARISON WITH LASSO 

For large-scale GRN inference, I used a set of mRNA expression 
measurements obtained from human epithelium at different stages 
of cancer development (Graham et al., 2010). The dataset was pro- 
duced by Graham et al. (2010) who performed gene expression 
analysis of breast epithelium tissue samples obtained from 42 
patients (18 cancer free, 18 had prophylactic mammoplasty. 



and 6 had reduction mammoplasty) in order to understand the 
differences in expression profiles of histologically normal breast 
epithelium and usual-risk controls undergoing reduction mam- 
moplasty. These expression profiles were used to infer the GRN 
that governs the regulatory mechanisms of human breast epithe- 
lium. The natural genetic variations caused by SNP, copy number 
variations, mutation, epigenetic regulation, etc., were considered 
to be genetic perturbations that led to different gene expression 
profiles among different patients. To save computational time, only 
top 2000 probe sets (1337 genes) with the highest between-sample 
variances were selected (Table S4 in Supplementary Material). 
Among the selected probes, there were 93 known TFs (Table S5 in 
Supplementary Material) which were used as potential regulators 
of the selected genes for network inference. 

Four different prior settings were used for the BVS framework. 
The parameter settings for the flat and sparse priors were left the 
same as before. TFBS information were collected from ENCODE 
(Hughes et al., 2000; Ernst et al., 2010), MEME (Bailey et al, 2009), 
TRANSFAC (Bryne et al., 2008), and JASPER (Matys et al, 2006) 
to construct the prior network (Ftfbs) that contains only direct 
gene regulations (Figure 4A). This network (Ftfes) contains 4963 
number of potential gene regulations between 93 TFs and 1317 
target genes (Table S6 in Supplementary Material). Information 
regarding PPI among TFs (Figure 4B) was collected from physi- 
cal TF binding data published by Ravasi et al. (2010) (Table S7 in 
Supplementary Material). This information along with the TFBS 
data were used to construct a second prior network (Ftpbs+ppi) 
which contains 16,372 potential regulatory interactions supported 
by both types of data (Table S8 in Supplementary Material). The 
confidence parameter (a^) was set to 2 and the restriction para- 
meter [k) were assigned a value of 12 (A: = ^j^^ ~ 12). The 
above prior settings, when used with the proposed BVS frame- 
work led to four different posterior networks that were then used 
for performance evaluation and comparison purposes. 

For performance comparison, a LASSO regression-based GRN 
inference algorithm (Wang et al, 2013) was selected due to recent 
popularity of LASSO-based methods in the network inference 
community (van Someren et al., 2003; Li and Yang, 2004; van 
Someren et al., 2006; Shimamura et al, 2007; Hecker et al., 2009, 
2012; Lee et al., 2009; Charbonnier et al, 2010; Gustafsson and 
Hornquist, 2010; James et al, 2010; Pan et al, 2010; Peng et al, 
2010; Wang et al., 2013). LASSO is a regularized version of least- 
square regression which uses the constraint that 1 1 P 1 1 ^ the L ' -norm 
of the regression coefiicients, is no greater than a given value. 
This is equivalent to an unconstrained minimization of the least- 
squares penalty with an added penalty Xlipil^, where X is a con- 
stant. As the penalty is increased, LASSO regression drives more 
and more of the regression coefficients (P) to 0, leaving fewer and 
fewer non-zero coefficients. Both LASSO and BVS share some sim- 
ilarities in their core formulations but differ in some key aspects 
in their implementations. For instance, both these algorithms rely 
on linear regression models, but LASSO uses absolute shrinkage 
regularization to deal with curse of dimensionality where BVS 
uses MCMC sampling for the same purpose. Therefore, compar- 
ing the results obtained from LASSO- and BVS-based techniques 
may reveal the strengths and weaknesses of algorithms which 
rely on regularization and MCMC sampling. Similar to the BVS 
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FIGURE 3 I The sensitivity of the BVS frameworl< to the confidence 
parameter (aj. Here, REF represents the reference/gold standard 
network. TFBS represents the prior network that uses only TFBS 
information. TFBS + PPI represents the prior network that uses both TFBS 
and PPI information. No prior knowledge (NPK) represents the network 
that was inferred using flat prior. SPARSE represents the network that was 
inferred using sparse prior. TFBS a = x represents the posterior network 
inferred from TypBs with the confidence parameter set to a, = x. 
TFBS + PPI a = x represents the posterior network inferred from Ttpbs^ppi 
with the confidence parameter set to a, = x. The above heatmap 
represents the similarities (in terms of Pearson correlation coefficients) 



among the reference, prior, and posterior networks. Values close to 1 (dark 
red) represent close agreement and values close to zero (dark blue) 
represent a lack of agreement between network topologies. This figure 
suggests that the prior networks (TFBS and TFBS + PPI) do not have 
significant overlap with the reference network (correlation coefficients 
0.42, 0.31, respectively). This is due to the fact that only 19 and 16% of the 
interactions that are present in the prior networks (TFBS and TFBS + PPI) 
are also present in the reference network (REF). Additionally, posterior 
networks inferred from the same prior network have a high degree of 
topological similarity (correlation coefficients 0.6-0.95), regardless of the 
value of the confidence parameter (aj. 



framework, three different prior settings were used for the LASSO- 
based algorithm. In tlie first case, no prior information was used, 
and in the second and third cases, Ftfbs and Ftpbs+ppi were 
used, respectively, as prior networks. The values of the regular- 
ization parameters were kept at their default values (Xi=0.2, 
\.2 = 0.8). This led to three different networks that were inferred 
by the LASSO-based algorithm. 

To evaluate the accuracy of the inferred networks, I compared 
these to a GSN which consists of a collection of 1726 known gene 
regulatory interactions obtained from the HTRIdb, Consensus- 
PathDB and KEGG databases (Figure 4C, see Table S9 in Sup- 
plementary Material for details). The GSN contains interactions 
between only 27 (out of 93) TFs and their target genes. There- 
fore, only the regulatory activities of these 27 TFs were compared 
and the activities of the remaining 66 TFs were excluded from the 



comparison. The comparison was done using ROC and PR curves 
as mentioned in the previous section. The resulting AUROC and 
AUPR values are shown in Figures 4D,E. These results suggest 
that the performance of the proposed BVS algorithm increased 
significantly when prior information was incorporated into the 
inference method. In particular, TFBS and PPI data collectively 
were more predictive of regulatory interactions than TFBS infor- 
mation alone. Moreover, BVS algorithm performed better than 
the LASSO-based method under all circumstances. As in the pre- 
vious section, the performance of BVS algorithm was found not 
to be sensitive (Figure 4F) to different values of the confidence 
parameter (a^). 

A possible reason behind the poor performance of LASSO can 
be low precision of the prior networks. The prior networks used 
in this study have many more interactions (~5000, 16,000) than 
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FIGURE 4 I Reconstructing GRN of human breast epithelium and 
comparison with LASSO. (A) Prior network based onTFBS information. 
(B) PPI among TPs. (C)Tlie gold standard network. (D) AUROCs of LASSO 
and BVS algorithms under different prior settings. (E) AUPRs of LASSO and 
BVS algorithms under different prior settings. (F) Sensitivity of the BVS 



algorithm to the confidence parameter (aj. Here, TFBS represents the prior 
network constructed from TFBS data, TFBS + PPI represents the prior 
network constructed from both TFBS and PPI information, a = 1, 2, 3, 4 
represents the networks inferred from Ttpbs^ppi with confidence parameters 
cic = 1 , 2, 3, 4, respectively. 
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the REF (~1700 interactions) and therefore have very low preci- 
sion. It was shown before that the performance of LASSO ciegrades 
rapidly as the precision of the prior information decreases (Wang 
et al., 2013). Additionally, the above results depend largely on the 
quality of the GSN which is a generic network consisting of the 
interactions involving the selected genes and TFs. This network 
does not necessarily reflect the tissue-specific behavior of the gene 
regulatory programs in breast cancer cells and therefore may not be 
ideal for performance evaluation purposes. However, this network 
was used as gold standard due to unavailability of information 
regarding tissue-specific GRNs. 

DISCUSSION 

In this study, I presented a new approach that incorporates TFBS 
data along with protein interactions among TFs in a BVS frame- 
work to infer GRNs. The main hypothesis behind this approach 
was that integrating protein interactions among TFs with TFBS 
data increases the predictive power of the inference process, espe- 
cially in a variable selection setting. This was demonstrated by 
inferring a liver-specific transcription regulatory network and the 
gene regulation program of human breast epithelium, and eval- 
uating the accuracy of the inferred networks based on known 
interactions. However, there are several shortcomings of the pro- 
posed data integration method. For instance, adding all indirect 
interactions, predicted from TF-TF PPIs, may result in a very large 
number of potential interactions, leading to a very low precision 
prior which may not contribute to the predictive power of the 
inference process. This issue can be mitigated by using informa- 
tion on protein complexes from relevant databases when these 
databases mature. The precision of the prior network can also be 
improved by removing unlikely edges that can be determined by 
other types of data, e.g. eQTL data. 

Moreover, the proposed BVS framework relies on a linear 
regression model of gene regulation. Although linear regression 
models are extensively used by network inference community due 
to ease of implementation, it was recently shown that tree-based 
regression models may be better suitable than linear regression 
models in network reconstruction problems (Huynh-Thu et al, 
2010). Therefore, a possible upgrade of the proposed Bayesian 
framework will be to replace the linear regression-based gene reg- 
ulation models by tree-based regression models. Additionally, in 
this study, I focused mainly on two types of external data sources, 
consensus motif data, and PPI data. There are a plethora of other 
functional genomics data, e.g. GO, SNP, gene orthology, etc., which 
can also be predictive of potential gene regulatory interactions. 
Our next objective is to find a meaningful way of incorporating 
such information into the BVS framework. 
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